In 2020 the PCCRC funded a project with the following objectives:
In addressing the first objective we decided that a through evaluation of the baselines for Chinook and Chum needed to be done. The genetics lab will be using Rubias instead of Bayes for the genetic stock identification (GSI) analyses and moving from a microsatellite baseline for chum to a SNP baseline.
The purpose of this document is to make the analysis as reproducible as possible and serve as a useful way of collaboration among individuals within the lab.
I don’t have much of the information on tissue sampling, extractions, amplifications so for now I will focus on the statistical analysis.
The chum microsatellite baseline (allele frequencies) can be downloaded from the Division of Fisheries and Oceans Canada (DFO) Molecular Genetics web page (http://www-sci.pac.dfo-mpo.gc.ca/mgl/data_e.htm) [Beacham 2009b, Beacham2009c]. The allele frequencies were converted to allele counts by Guyon et al. (J. Guyon et al. 2010) to create a BAYES (Pella 2001) baseline file. The program rubias (Moran and Anderson 2019) requires genotypes for individuals. Assuming linkage and Hardy-Weinberg equilibirum, I generated individual genotypes by sampling the allele counts without replacement (https://patbarry6.github.io/ABL_GSI/). This will maintain the allele frequencies and sample sizes for each population. There was a single issue where there were an odd number of alleles scored (Locus 2 in population 240). This issue may be the result of conversion from allele fequencies from DFO to allele counts or just a rounding issues from DFO with an allele frequency. My solution was to add one to the most abundant allele. This does not appreciably change the allele frequency and seems more reasonable that doubling the population size to have an even number of alleles to sample. Doing so would give us an increased confidence in our characterization of that population.
Microsatellite allele from the NOAA lab were converted to allele size calls from DFO by running a set of samples from DFO run on the ABL 3130xl. When TSMRI exports their allele calls they need to convert them to the DFO allele sizes before running GSI on their baseline.
The baseline is composed of 381 populations grouped into six regions: Southeast Asia, Northeast Asia, Western Alaska, Upper/Middle Yukon, Southwest Alaska, and the Eastern Gulf of Alaska/Pacific Northwest (C. A. V. Kondzela C.M. AND Marvin and Guyon 2013). These groupings were selected based on principal coordinate analysis (PCO), NJ trees with Cavalli-Sforza and Edwards chord distances [Beacham2009], and simulations were conducted in SPAM software (100%, equal, and realistic fishery mixture proportions). I wonder if we will be able to split out the E GOA and PNW populations with the new SNP baseline. That might open up a giant can of political worms.
The six regional reporting groups used for GSI for chum salmon
So let’s read in the chum microsatellite baseline formatted for rubias.
ChumUsatBase<-read.csv("./MasterData/ChumBaseline_rubias.csv",stringsAsFactors = F)
RepGroups_ChumUsat<-ChumUsatBase%>%
group_by(repunit)%>%
summarise(nPops = length(unique(collection)),
nInd = length(unique(indiv)))
knitr::kable(RepGroups_ChumUsat)
| repunit | nPops | nInd |
|---|---|---|
| E_GOP_PNW | 242 | 32767 |
| EAsia | 31 | 2916 |
| NAsia | 31 | 2804 |
| SW_Alaska | 17 | 1636 |
| UpMidYukon | 23 | 4431 |
| WAlaska | 37 | 6661 |
For the reporting groups the size of the GOA/PNW greatly exceeds that of the other groups. And Southwest AK has the fewest number of collections. Let’s take a closer look at how man indiviuals represent each collection.
CollectionInfo_ChumUsats<-ChumUsatBase%>%
group_by(repunit,collection)%>%
summarise(nInd = length(unique(indiv)))
CollectionInfo_ChumUsats%>%
group_by(repunit)%>%
top_n(10)%>%
knitr::kable()%>%
kableExtra::kable_styling()
| repunit | collection | nInd |
|---|---|---|
| E_GOP_PNW | Clapp_Basin | 362 |
| E_GOP_PNW | Disappearance | 325 |
| E_GOP_PNW | Inch_Creek | 401 |
| E_GOP_PNW | Indian_River | 341 |
| E_GOP_PNW | Kitasoo | 408 |
| E_GOP_PNW | Lagoon_Inlet | 364 |
| E_GOP_PNW | McLoughin_Creek | 377 |
| E_GOP_PNW | Nekite | 486 |
| E_GOP_PNW | Nimpkish | 408 |
| E_GOP_PNW | Squakum | 385 |
| EAsia | Abashiri | 130 |
| EAsia | Chitose | 268 |
| EAsia | Gakko_River | 159 |
| EAsia | Kawabukuro | 115 |
| EAsia | Shibetsu | 110 |
| EAsia | Shiriuchi | 155 |
| EAsia | Tokachi | 129 |
| EAsia | Tokoro | 119 |
| EAsia | Tokushibetsu | 158 |
| EAsia | Yurappu | 160 |
| NAsia | Amur | 339 |
| NAsia | Bolshaya | 97 |
| NAsia | Hairusova | 182 |
| NAsia | Kikchik | 102 |
| NAsia | Naiba | 147 |
| NAsia | Ola_ | 119 |
| NAsia | Ossora | 137 |
| NAsia | Pymta | 99 |
| NAsia | Tugur_River | 98 |
| NAsia | Vorovskaya | 248 |
| SW_Alaska | Alogoshak | 94 |
| SW_Alaska | American_River | 95 |
| SW_Alaska | Delta_Creek | 94 |
| SW_Alaska | Frosty_Creek | 178 |
| SW_Alaska | Gertrude_Creek | 99 |
| SW_Alaska | Joshua_Green | 99 |
| SW_Alaska | Moller_Bay | 95 |
| SW_Alaska | Pumice_Creek | 95 |
| SW_Alaska | Stepovak_Bay | 99 |
| SW_Alaska | Volcano_Bay | 106 |
| UpMidYukon | Big_Creek | 196 |
| UpMidYukon | Chandalar | 225 |
| UpMidYukon | Cheena | 230 |
| UpMidYukon | Fishing_Br | 665 |
| UpMidYukon | Kluane | 535 |
| UpMidYukon | Porcupine | 325 |
| UpMidYukon | Salcha | 182 |
| UpMidYukon | Sheenjek | 256 |
| UpMidYukon | Tatchun | 170 |
| UpMidYukon | Toklat | 245 |
| WAlaska | Andreafsky | 331 |
| WAlaska | Eldorado | 397 |
| WAlaska | Gisasa | 294 |
| WAlaska | Kobuk | 379 |
| WAlaska | Kwiniuk_River | 269 |
| WAlaska | Niukluk | 232 |
| WAlaska | Pikmiktalik | 394 |
| WAlaska | Pilgrim_River | 478 |
| WAlaska | Snake | 399 |
| WAlaska | Tozitna | 353 |
Within the collections the average size is 134.42 (min = 12 and max = 665). Surprisingly, we have pretty good sample sizes for the Russia (NAsia collections).
This baseline has been extremely well vetted. It seems like a small waste of time to run these tests right now, but if we decide to make this a short manuscript it might be a good idea to re-run the analyses just to be on the safe side.
First we need to convert the rubias input file into a genepop input file with a POP designator. This shouldn’t be too difficult. We can just create an index by population and write each population as a group of lines to a text file.
Loci<-colnames(ChumUsatBase)[-(1:4)] %>%
.[c(TRUE,FALSE)]
Title<-"ChumBaseline"
ChumGenos<-sapply(1:(length(Loci)*2), function(x) formatC(ChumUsatBase[,-(1:4)][,x], width=3,
format="d",flag="0"))
colnames(ChumGenos)<-colnames(ChumUsatBase[,-(1:4)])
ChumGenos6<-lapply(Loci, function(x) apply(ChumGenos[,
grep(pattern=x,
x=colnames(ChumGenos))],
MARGIN = 1,
paste,
collapse = "")%>%
as.vector())
ChumGenosMat<-do.call(cbind,ChumGenos6)
ChumGenosMat[ChumGenosMat == " NA NA"]<-"000000"
collections<-unique(ChumUsatBase$collection)
PopIndex<-lapply((1:length(collections)), function(x)
which(ChumUsatBase$collection==collections[x]))
write.table(Title,"ChumBaseline_gp.gen",append=F,quote=F,row.names=F,col.names = F)
write.table(Loci,"ChumBaseline_gp.gen",append=T,quote=F,row.names=F,col.names = F)
for (x in 1:length(collections)){
write.table("POP","ChumBaseline_gp.gen",append=T,quote=F,row.names=F,col.names = F)
TempPop <- data.frame(id=ChumUsatBase[PopIndex[[x]],4],
ChumGenosMat[PopIndex[[x]],],stringsAsFactors = F)
Lines2write <- sapply(1:nrow(TempPop),function(y)
paste(TempPop[y,1], " , " ,
paste(as.character(TempPop[y,-1]),collapse=" "),
sep=""))
write.table(Lines2write,"ChumBaseline_gp.gen",append=T,quote=F,row.names=F,col.names = F)
}
Ok, so now we have a genepop file for the chum microsatellite baseline. Let’s pull it into R with adgenet’s read genepop function and then calculate some Nei’s \(F_{st}\) values. In the markdown document we won’t evaluate this piece everytime becuase the bootstrapping over 300 populations can take some time. Instead we will read in the results at the top of the document.
Chum_uSatBase.genid<-adegenet::read.genepop("./ChumBaseline_gp.gen",ncode=3)
ChumuSatBase_hierf <- hierfstat::genind2hierfstat(Chum_uSatBase.genid)
Chum_uSatFst.Nei<-hierfstat::pairwise.neifst(ChumuSatBase_hierf)
Chum_uSatFst_bs <- hierfstat::boot.ppfst(dat=ChumuSatBase_gdf,nboot=100,quant=c(0.025,0.975),diploid=TRUE)
diag(Chum_uSatFst.Nei)<-0
I originally made a heatmap based on the \(F_{st}\) distances, but it is pretty darn ugly, and a neighbor joining tree uses the same information in a much easier format to digest.
NJ.chumuSat.Nei<-phangorn::NJ(Chum_uSatFst.Nei)
NJ.chumuSat.Nei$tip.label<-unique(gsub(pattern="_[0-9]+$",replacement="",Chum_uSatBase@pop))
PopRepGroups_ChumuSat<-ChumUsatBase%>%
group_by(repunit,collection)%>%
slice(1)%>%
select(Pop = collection ,RepGroup = repunit)%>%
ungroup()
# p + geom_text2(aes(subset=!isTip, label=node)) #use this line to get the node number
# viewClade(pNei, node=612) #use this line to look at a specific node
P_ChumuSat<-ggtree(NJ.chumuSat.Nei) %<+% PopRepGroups_ChumuSat +
geom_tiplab(aes(fill = factor(RepGroup)),
size = 1.75,
color = "black", # color for label font
geom = "label", # labels not text
label.padding = unit(0.15, "lines"), # amount of padding around the labels
label.size = 0.01) +
theme(legend.position = c(0.8,0.2),
legend.title = element_blank(), # no title
legend.key = element_blank())
P_ChumuSat
That looks pretty good. Southeast Asia, Western Alaska, and the Upper/Middle Yukon all look like great groupings. All the collections in the reporting units are included under distinct nodes. The majority of the SW Alaska collections are grouped within a node with GOA/PNW samples; however, there are a few collections from South Bristol Bay that group with the North Asia collections.
Let’s take a closer look at that one
viewClade(P_ChumuSat, node=628)
Ok, so it is a bunch of South Bristol Bay pops that are clustering with the Asia collections. Why would that be? Why wouldn’t we see Asia collections more similar to the Alaska Peninsula pops. The glacial history and the long branch of the Sturgeon River collection from SW Alaska might give us some insight. Sturgeon is on the western part of Kodiak island. The Sturgeon River chum salmon have a unique run timing compared to other chum populations in the Kodiak Archipelago. Sturgeon chum enter the lagoon area earlier (in late May to mid-June) compared to other Kodiak chum salmon stocks (mid-to-late July; (Price 2001)). It is hypothesized that Sturgeon is highly differentiated because the southwest portion of Kodiak island remained ice-free during the last glaical maxima (Petrou et al. 2014) and that the run timing difference has minimized the homogenizing effect of geneflow.
Lets take a look at a principal component analysis (PCA).
X.Chum_uSat <- scaleGen(Chum_uSatBase.genid,NA.method="mean",center=T,scale=T)# can set the missing to NA, 0, or mean
pca.Chum_uSat <- dudi.pca(X.Chum_uSat,cent=F,scale=F,scannf=FALSE) #nf is kept axes, note you can center and scale in this step
barplot(pca.Chum_uSat$eig[1:15],main="PCA eigenvalues", col=heat.colors(15))
PopRGmap<-data.frame(Pop = as.character(unique(pop(Chum_uSatBase.genid))),
RepGroup = as.character(sapply(1:length(unique(pop(Chum_uSatBase.genid))),function(x)
ChumUsatBase[grep(unique(pop(Chum_uSatBase.genid))[x],
x=ChumUsatBase$indiv),2])),
stringsAsFactors = F)
Chum_uSat_PCAdf<-data.frame(Ind = dimnames(Chum_uSatBase.genid@tab)[[1]],
Pop = pop(Chum_uSatBase.genid),
RepGroup = plyr::mapvalues(
x=pop(Chum_uSatBase.genid), from = PopRGmap$Pop,
to = PopRGmap$RepGroup),
pca.Chum_uSat$li,
stringsAsFactors = F)
Chum_uSat_PCAdf$Pop <- gsub(PCAdf$Pop, pattern="_[0-9]+$", replacement = "" )
ggplot(Chum_uSat_PCAdf) +
geom_point( aes(Axis1, Axis2,color=as.factor(RepGroup)),size=2,alpha = 1/2)+
theme(axis.line = element_blank(),
panel.background = element_blank()) +
geom_vline(xintercept = 0)+
geom_hline(yintercept = 0)+
scale_color_discrete(name = "Reporting Group")
Chum_uSat_PCA1_Loads<-data.frame(pca.Chum_uSat$c1[order(pca.Chum_uSat$c1[,1]^2,decreasing=T),],
pca.Chum_uSat$c1[order(pca.Chum_uSat$c1[,1]^2,decreasing=T),]^2,
cumsum(pca.Chum_uSat$c1[order(pca.Chum_uSat$c1[,1]^2,decreasing=T),]^2))
Loads<-head(Chum_uSat_PCA1_Loads,10)%>%
select(Loadings=CS1,Variance=CS1.1,CumSum=CS1.2)
#Let's look at Ots3 allele freq across the rep groups
OtsInfo<-data.frame(Ind = dimnames(Chum_uSatBase.genid@tab)[[1]],
Pop = pop(Chum_uSatBase.genid),
RepGroup = plyr::mapvalues(
x=pop(Chum_uSatBase.genid), from = PopRGmap$Pop,
to = PopRGmap$RepGroup),
Chum_uSatBase.genid@tab[,grep("Ots3",colnames(Chum_uSatBase.genid@tab))]
)
OtsRGcounts<-OtsInfo%>%
group_by(RepGroup)%>%
select_at(-(1:2))%>%
replace(is.na(.), 0)%>%
summarise_all(funs(sum))
OtsRGfreq<-OtsRGcounts%>%
{sapply(2:ncol(.),function(x) .[,x]/sum(.[,x]))}%>%
data.frame(RepGroup = OtsRGcounts$RepGroup,
.)
AllelesInt<-c(106,086,092,094)
OtsRGfreq%>%
.[,c(1,sapply(1:length(AllelesInt),function (x) grep(pattern=AllelesInt[x],x=colnames(OtsRGfreq))))]
## RepGroup Ots3.106 Ots3.086 Ots3.092 Ots3.094
## 1 EAsia 0.03055813 0.0016109886 0.142421470 0.046413502
## 2 NAsia 0.08847546 0.0074190266 0.101250381 0.008077155
## 3 WAlaska 0.34817746 0.0105138206 0.189082037 0.013984328
## 4 UpMidYukon 0.32620591 0.0002119722 0.434583715 0.001928873
## 5 SW_Alaska 0.03720852 0.0066135323 0.004574565 0.002290536
## 6 E_GOP_PNW 0.16937453 0.9736306597 0.128087832 0.927305606
Similar to the NJ tree, the major division among the reporting groups is between the GOA/PNW collections and all other reporting groups (Axis 1). From the loadings it appears that the locus Ots3 is really informative. The second PC separates Asia from Alaska collections.
The leave one out test evaluates how well each fish within the baseline can be assigned to their population or reporting group of origin. The test is conducted by removing fish one at a time from each baseline population and then estimating their collection of origin. These types of tests are very useful to see where the most common type of misallocation might be to. It is most easily viewed with a heatmap displaying the probability of assignment of fish back to each collection or reporting group.
sa_ChumuSats <- self_assign(reference = ChumUsatBase, gen_start_col = 5)
## Summary Statistics:
##
## 51215 Individuals in Sample
##
## 11 Loci: Oki100, Omm1070, Omy1011, One101, One102, One104, One114, Ots103, Ots3, Otsg68, Ssa419
##
## 6 Reporting Units: EAsia, NAsia, WAlaska, UpMidYukon, SW_Alaska, E_GOP_PNW
##
## 381 Collections: Namdae_R, Miomote, Kawabukuro, Hayatsuki, Gakko_River, Uono_River, Ohkawa, Tsugaruishi, Orikasa, Sakari_River, Koizumi_River, Tokachi, Kushiro, Teshio, Chitose, Toshibetsu, Shiriuchi, Shikiu, Yurappu, Shizunai, Tokoro, Abashiri, Horonai, Shari_River, Tokushibetsu, Shibetsu, Nishibetsu, Ryazanovka, Avakumovka, Narva, Suifen, Amur, Naiba, Kalininka, Tym_, Udarnitsa, Tauy, Ola_, Magadan, Okhota, Tugur_River, Oklan, Penzhina, Vorovskaya, Kol_, Pymta, Hairusova, Utka_River, Kikchik, Bolshaya, Plotnikova_R, Kamchatka, Ossora, Nerpichi, Ivashka, Karaga, Dranka, Apuka_River, Olutorsky_Bay, Zhypanova, Anadyr, Kanchalan, Kelly_Lake, Noatak, Imnachuk, Kobuk, Agiapuk, Koyuk, Peel_River, Pikmiktalik, Niukluk, Pilgrim_River, Kwiniuk_River, Snake, Unalakleet, Ungalik, Nome, Shaktoolik, Eldorado, Andreafsky, Chulinak, Tozitna, Anvik, Nulato, Melozitna, Gisasa, Aniak, George, Kanektok, Kasigluk, Kwethluk, Nunsatuk, Goodnews, Stuyahok_River, Mulchatna, Naknek, Togiak, Alagnak, Henshaw_Creek, Jim_River, Koyukuk_south, Koyukuk_late, Cheena, Salcha, Delta, Toklat, Kantishna, Sheenjek, Black_River, Chandalar, Big_Salt, Tatchun, Pelly, Big_Creek, Minto, Kluane, Donjek, Kluane_Lake, Fishing_Br, Porcupine, Teslin, Chandindu, Egegik, Gertrude_Creek, Pumice_Creek, Meshik, Moller_Bay, Joshua_Green, Frosty_Creek, Volcano_Bay, Coleman_Creek, Delta_Creek, Westward_Creek, Stepovak_Bay, Alogoshak, Big_River, American_River, Sturgeon, Uganik, Keta_Creek, Wells_River, Constantine, Olsen_Creek, Fish_Creek, Dipac_Hatchery, Gambier, Wells_Bridge, Herman_Creek, Sawmill, Greens, Kennell, Disappearance, Neets_Bay_late, Neets_Bay_early, Carroll, Nakut_Su, LagoonCr, Takwahoni, Taku, Tuskwa, Yellow_Bluff, Shustnini, Gold_Harbour, Mace_Creek, Botany_Creek, Clapp_Basin, Security, Dawson_Inlet, Mountain_Cr, Fairfax_Inlet, Kano_Inlet_Cr, Steel_Cr, Seal_Inlet_Cr, Thorsen, Pallant, Sedgewick, Lagoon_Inlet, Bag_Harbour, Salmon_River, Little_Goose, Dana_Creek, Pacofi, Surprise, Hutton_Head, Ain_, Awun, Naden, Stanley, Slatechuck_Cre, Government, North_Arm, Tarundl_Creek, Deena, Lagins, Buck_Channel, Honna, Kshwan, Khutzeymateen, Lachmach, Stagoo, Toon, Dak_, Kateen, Kitsault_Riv, Ensheshese, Wilauks_Cr, Illiance, Tseax, Lizard_Cr, Crag_Cr, Stumaun_Cr, Ksedin, Kincolith, Nass_River, Nangeese, Date_Creek, Upper_Kitsumkal, Kitwanga, Ecstall_River, Whitebottom_Cr, Kalum, Dog-tag, Andesite_Cr, Kleanza_Cr, Kispiox, Skeena, Zymagotitz, Wilson_Creek, Markle_Inlet_Cr, Pa-aat_River, Kumealon, Stewart_Cr, Kxngeal_Cr, Klewnuggit_Cr, Bella_Bell, Martin_Riv, Snootli, Cascade, Bella_Coola, Dean_River, Jenny_Bay, Kimsquit, Elcho_Creek, Kemano, Gill_Creek, Turn_Creek, Quaal_River, Kitimat_River, Green_River, Khutze_River, Foch_Creek, Blackrock_Creek, Nias_Creek, West_Arm_Creek, Kitasoo, Clatse_Creek, Quartcha_Creek, McLoughin_Creek, Roscoe_Creek, Mussel_River, Kwakusdis_River, Neekas_Creek, Duthie_Creek, Kainet_River, Barnard, Tyler, Gilttoyee, Lard, Cooper_Inlet, Bullock_Chann, Deer_Pass, Skowquiltz, Frenchman, Hooknose, Nooseseck, Kimsquit_Bay, Bish_Cr, Kiltuish, Salmon_Bay, Cheenis_Lake, Necleetsconnay, East_Arm, Arnoup_Cr, Flux_Cr, Turtle_Cr, MacNair_Creek, Chuckwalla, Clyak, Lockhart-Gordon, Quap, Ashlulm, Draney, Milton, Walkum, Nekite, Klinaklini, Ahta______, Viner_Sound, Waump, Taaltz, Nimpkish, Kakweiken, Glendale, Algard, Ahnuhati, Mackenzie_Sound, Heydon_Cre, Tzoonie, Cheakamus, Sliammon, Mamquam, Wortley_Creek, Squamish, Indian_River, Theodosia, Southgate, Orford, Shovelnose_Cr, Mashiter_Creek, Stawamus, Homathko, Goldstream, Cowichan, Nanaimo, Chemainus, Puntledge, Big_Qual, Little_Qua, Campbell_River, Cold_Creek, Smith_Cree, Demamiel, Nitinat, Hathaway_Creek, Pegattum_Creek, Goodspeed_River, Cayeghle, Colonial, Sugsaw, Nahmint_River, Silverdale, Squakum, Wahleach, Chilliwack, Chehalis, Stave, Alouette, Vedder, Harrison, Inch_Creek, Lower_Lillooet, NorrishWorth, Alouette_North, Widgeon_Slough, Kawkawa, Blaney_Creek, Chilqua_Creek, Serpentine_R, Kanaka_Cr, Worth_Creek, Hopedale_Cr, Hicks_Cr, Harrison_late, Nooksack, Skagit, County_Line, Tulalip, Grant_Creek, Siberia_Creek, SkykomishRiv, Kennedy_Creek, Minter_Cr, GreenRrHatchery, Big_Quilcene, Hoodsport, Salmon_Cr, Elwha, Ellsworth_Cr, Bitter_Creek, Quinault, Satsop
##
## 7.98% of allelic data identified as missing
sa_to_repu_ChumuSats <- sa_ChumuSats %>%
group_by(indiv, collection, repunit, inferred_repunit) %>%
summarise(repu_scaled_like = sum(scaled_likelihood))
RepGroups <- unique(PopRepGroups_ChumuSat$RepGroup)[c(2,3,6,5,4,1)]
sa_mean_scaled_like_ChumuSats <- sa_to_repu_ChumuSats %>%
mutate(repunit_f = factor(x = repunit, levels = RepGroups)) %>% #turn character into factor
mutate(inferred_repunit_f = factor(x = inferred_repunit, levels = RepGroups)) %>% #turn character into factor
group_by(repunit_f, inferred_repunit_f) %>%
summarise(mean_repu_scaled_like = mean(repu_scaled_like)) #take the mean of the scaled liklihoods
#Generate a heatmap
ggplot(sa_mean_scaled_like_ChumuSats,aes(x = repunit_f, y = inferred_repunit_f, z = mean_repu_scaled_like)) +
geom_tile(aes(fill = mean_repu_scaled_like)) +
geom_text(aes(label=sprintf(unlist(sa_mean_scaled_like_ChumuSats[,3]),fmt='%#.2f')))+
scale_fill_gradient(low = "white", high = "red", limits = c(0, 1), name = "Mean Scaled Likelihood\n") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Reporting Unit") +
ylab("Inferred Reporting Unit")
This is somewhat surprising to me. While the missallocation of SW Alaska samples to the GOA seems reasonable, we have non-trivial misallocation of NE Asia samples to to the GOA/PNW. Let’s take a closer look at the NE Asia baseline samples by subsetting on them, then looking at the collection with the highest likelihood of assignment (we won’t sum over likelihoods for the reporting groups).
MissAlloc<-sa_ChumuSats%>%
filter(repunit == "NAsia")%>%
group_by(indiv,collection)%>%
top_n(n=1,wt=scaled_likelihood)%>%
filter(inferred_repunit != "NAsia")%>%
{table(.$collection, .$inferred_collection)}
rowSums(MissAlloc)
## Amur Anadyr Apuka_River Bolshaya Dranka
## 89 29 18 19 16
## Hairusova Ivashka Kalininka Kamchatka Kanchalan
## 57 16 1 25 22
## Karaga Kikchik Kol_ Magadan Naiba
## 14 39 30 28 39
## Nerpichi Okhota Oklan Ola_ Olutorsky_Bay
## 18 22 10 42 19
## Ossora Penzhina Plotnikova_R Pymta Tauy
## 50 8 27 46 21
## Tugur_River Tym_ Udarnitsa Utka_River Vorovskaya
## 36 20 10 8 86
## Zhypanova
## 10
The Amur and the Vorovskaya collections have an unusually high number of missallocations. This is likely due to really unequal sample sizes in the baseline collections. Lets take a quick look to confirm this.
MissAllocR<-rowSums(MissAlloc) / CollectionInfo_ChumUsats[sapply(1:length(rowSums(MissAlloc)),
function(x)
grep(pattern = names(rowSums(MissAlloc))[x],
x = CollectionInfo_ChumUsats$collection,
)),3]
MissAllocDF<-cbind(names(rowSums(MissAlloc)),MissAllocR)
colnames(MissAllocDF)<-c("Pop","Prop")
ggplot(MissAllocDF, aes(x=Prop))+
geom_histogram()+
geom_vline(xintercept=MissAllocDF[MissAllocDF$Pop%in%c("Amur","Vorovskaya"),2],
lty=2,col="red")
So those two populations aren’t particularly bad, they just have a bunch of samples, the collections that are pretty bad are Nerpichi and Pymta.
This doesn’t look good for individual assignment, but for GSI it may all come out in the wash. Let’s do some 100% simulations.
Proof100_scenarios_ChumuSats <- lapply(RepGroups, function(x) tibble(repunit = x, ppn = 1.0))
names(Proof100_scenarios_ChumuSats) <- paste("All", RepGroups, sep = "-")
Proof100_results_ChumuSats <- assess_reference_loo(reference = ChumUsatBase,
gen_start_col = 5,
reps = 10,
mixsize = 100,
alpha_repunit = Proof100_scenarios,
alpha_collection = 10) #distribution of individuals from collections ~equal
Proof100_results_ChumuSats %>%
mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>% #[c(2,3,6,5,4,1)]
group_by(repunit_scenario, iter, repunit_f) %>%
summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n)) %>%
filter(gsub(pattern="All-","",repunit_scenario) == repunit_f) %>%
ggplot(aes(x = iter, y = repprop_posterior_mean, fill = repunit_f)) +
geom_bar(stat = 'identity') +
geom_hline(yintercept = 0.9) +
scale_fill_discrete(name = "Reporting Group") +
facet_wrap(~ repunit_f) +
xlab("Iteration") +
ylab("Posterior Mean Reporting Group Proportion")
Strange in 100% mixtures we seem to do fine for the NE Asia and SW_Alaska collections but the Upper Middle Yukon looks to have some issues. We probabily are overestimating the Southwest AK proportion in those simulations based on the loo results above.
Proof100_results_ChumuSats%>%
filter(repunit_scenario == "All-UpMidYukon") %>%
group_by(iter, repunit)%>%
summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n)) %>%
mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>%
ggplot(aes(x = repunit, y = repprop_posterior_mean, fill = repunit_f)) +
geom_boxplot() +
geom_hline(yintercept = 0.9,lty=2,color="red") +
scale_fill_discrete(name = "Reporting Group") +
xlab("Iteration") +
ylab("Posterior Mean Reporting Group Proportion") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
So for the 100% simulations for the Upper Middle Yukon, we overestimate the proportion for Western Alaska. We will likely never see 100% of a mixture originating from the Upper/Middle Yukon unless we are sampling a fishery on the river, so lets proceed with some realistic fishery simulations that are taken from the BSAI trawl fishery.
For the realistic fishery simulations we are going to pull the 2017 report for for the 2015 Bering Sea Walleye Pollock Trawl Fishery and Gulf of Alaska Groundfish Fisheries (J. A. V. Kondzela C.M. AND Whittle and Guyon 2017). We will do 10 replicates for mixtures that we might actually see in the fishery.
# Lets instead pull the 2017 analyses that were done and do 10 replicates for
# Mixtures that we might actually see in the fishery
ChumAnalyses<-read_csv("./MasterData/chum_all_years_gsi_summary.csv")
RepGroupMap<-cbind(as.character(RepGroups),
unique(as.character(ChumAnalyses$reporting_group))[c(2,4,3,5,6,1)]
)
ChumAnalyses$reporting_group<-plyr::mapvalues(from = RepGroupMap[,2] ,
to = RepGroupMap[,1],
x= ChumAnalyses$reporting_group )
Chum2017<-ChumAnalyses %>%
filter(year ==2017) %>%
mutate(Analysis = paste(lme,spatialstrat,temporalstrat,sep="_")) %>%
group_by(Analysis)%>%
select(c(Analysis,reporting_group,mean))
RF_scenarios <- lapply(unique(Chum2017$Analysis), function(x)
Chum2017 %>%
filter(Analysis == x) %>%
ungroup()%>%
select(-1)%>%
setNames(.,c("repunit", "ppn" ))
)
names(RF_scenarios) <- ChumAnalyses %>%
filter(year ==2017) %>%
mutate(Analysis = paste(lme,spatialstrat,temporalstrat,sep="_")) %>%
select(Analysis) %>%
unique()%>%
unlist()
RF_results_ChumUsats <- assess_reference_loo(reference = ChumUsatBase,
gen_start_col = 5,
reps = 10,
mixsize = 100,
alpha_repunit = RF_scenarios,
alpha_collection = 10) #distribution of individuals from collections ~equal
RFsum_ChumUsats <- RF_results_ChumUsats %>%
mutate(repunit_f = factor(x = repunit, levels = RepGroups))%>%
group_by(repunit_scenario, iter, repunit_f) %>%
summarise(true_repprop = sum(true_pi), reprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n))
ggplot(RFsum_ChumUsats, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit_f)) +
geom_point() +
scale_color_discrete(name = "Reporting Group")+
geom_abline(intercept = 0, slope = 1)+
geom_abline(intercept = 0.1, slope = 1,lty=2)+
geom_abline(intercept = -0.1, slope = 1,lty=2)+
facet_wrap(~ repunit_f)+
ylab("Posterior Mean Reporting Group Proportion") +
xlab("Simulated Mixture Proportion")
That looks pretty good! Lets see how the SNPs measure up!
There are a few different SNP baselines that ADF&G uses, but the most comprehensive geographically has 91 loci genotyped for 310 populations that are grouped into 9 reporting groups (DeCovich et al. 2012). The main difference between the SNP and microsatellite reporting groups are that the NE and SE Asian reporting groups are merged into a single reporting group and they have broken out some reporting groups from Coastal Western Alaska.
The regional reporting groups for the SNP baseline
So the baseline that I recieved from Kyle Shedd was for 91 loci, but there appears to not be complete overlap with the GT_seq panel that Wes is using for the NOAA panel. So we first need to find the overlap in the two panels.
First lets read in the GTseq_RAD_chum SNPs that Wes sent and then look at the overlap with the ADF&G panel.
GTseq287<-read_csv("./MasterData/GTseq_RAD_chum.csv")%>%
select("Fwd")%>%
unlist()%>%
gsub(pattern="_F",replacement="",.)%>%
as.vector()%>%
tolower()
ADFG_Panel<-read_lines("./MasterData/WASSIP310Pops91loci.gen")[2:92]%>%
tolower()%>%
str_split(pattern="\\.")%>% #this splits the haplotypes
unlist()%>%
gsub(pattern="-",replacement="_")
#Full match loci
any(GTseq287%in%ADFG_Panel)
## [1] FALSE
#so lets look for partial matches using grep and the species and gene names
Matches<-lapply(1:91, function(x) c(ADFG_Panel[x],
grep(gsub(pattern="_[0-9]+$",
replacement="",
ADFG_Panel[x]),GTseq287,value=T)))
#Now lets pull all the matches in length greater than 1
Matches[which(lapply(FUN=length,X=Matches)>1)]
## [[1]]
## [1] "oke_u200_385" "oke_u2001_629" "oke_u2003_142" "oke_u2005_62"
##
## [[2]]
## [1] "oke_ahr1_78" "oke_ahr1_278"
##
## [[3]]
## [1] "oke_u1002_262" "oke_u1002_165"
##
## [[4]]
## [1] "oke_psmd9_57" "oke_psmd9_188" "oke_psmd9_057"
##
## [[5]]
## [1] "oke_cks1_94" "oke_cks1_70"
##
## [[6]]
## [1] "oke_u507_286" "oke_u507_087"
##
## [[7]]
## [1] "oke_cks_389" "oke_cks1_70"
Yikes, this looks bad. Only 2 loci look like true overlap between the GTSeq287 panel and the ADF&G baseline. What is going on? Based on what Wes had in his ‘chinook_chum_sockeye_snp.xlsx’ file there should be 63 matches! Lets do a quick manual search of the ‘GTseq_RAD_order_200521 (with coho, steelhead).xlsx’ file that has all the loci named. The first match is ‘Oke_ACOT-100’, so let’s see if that locus is in the file that Wes sent me. Strange it isn’t! So maybe he was working off a different file. Let’s look for the WDFW GT_seq350 SNP panel and see what is going on.
The [GT_seq350 panel] (https://www.psc.org/download/466/information/11005/s15-i09-s14-i17-chum-salmon-southern-area-genetic-baseline-enhancement-part-1-and-part-2-amplicon-development-expanded-baseline-collections-and-genotyping.pdf) is avalable in a PDF so we have to copy and paste into excel to get something to work with that must be what Wes was working off of! Let me pull in those loci and re-run the check for overlapping loci. There are only 287 loci that are in the file that he sent me. So 63 loci are missing… what are the chances that those are the overlapping loci?
GTseq350<-read_csv("./MasterData/WDFW_GTSeq350.csv",col_names=F)%>%
select(1)%>%
unlist()%>%
gsub(pattern="_F",replacement="",.)%>%
as.vector()%>%
tolower()
#Full match loci
any(GTseq350%in%ADFG_Panel)
## [1] TRUE
sum(GTseq350%in%ADFG_Panel)
## [1] 61
# so lets look for full and partial matches
# using grep and the species and gene names
Matches<-lapply(1:91, function(x) c(ADFG_Panel[x],
grep(gsub(pattern="_[0-9]+$",
replacement="",
ADFG_Panel[x]),GTseq350,value=T)))
#Now lets pull all the matches in length greater than 1
Matches<-Matches[which(lapply(FUN=length,X=Matches)>1)]
#And lets clean up the matches that are exact
Index<-which(lapply(FUN=length,X=Matches)>2)
NonExactMatch<-list()
x<-1
for (x in Index){
if(any(Matches[[x]][1]==Matches[[x]][-1])) {
Matches[[x]]<-c(Matches[[x]][1], Matches[[x]][which(Matches[[x]][1]==Matches[[x]][-1])+1])
} else {
NonExactMatch[[x]]<-Matches[[x]]
x<-x+1
}
}
Matches[which(lapply(FUN=length,X=Matches)>2)]
## [[1]]
## [1] "oke_psmd9_57" "oke_psmd9_057" "oke_psmd9_188"
PartMat<-which(lapply(FUN=length,X=Matches)>2)
Matches[[PartMat]]<-Matches[[PartMat]][-3]
MatchMat<-matrix(data=unlist(Matches),nrow=length(Matches),ncol=2,byrow=T)
colnames(MatchMat)<-c("ADFG","WDFW")
MatchMat<-MatchMat[-(grep(pattern="cks1|pgap|ahr1",x=MatchMat[,1])),]
rownames(MatchMat)<-paste("Locus",seq(1,nrow(MatchMat),1),sep="_")
ADFG_Loci2Use<-MatchMat[,1]
knitr::kable(MatchMat,"html")%>%
kableExtra::kable_styling()#%>%
| ADFG | WDFW | |
|---|---|---|
| Locus_1 | oke_ppa2_635 | oke_ppa2_635 |
| Locus_2 | oke_u200_385 | oke_u200_385 |
| Locus_3 | oke_rfc2_618 | oke_rfc2_618 |
| Locus_4 | oke_ras1_249 | oke_ras1_249 |
| Locus_5 | oke_u1017_52 | oke_u1017_52 |
| Locus_6 | oke_il_1racp_67 | oke_il_1racp_67 |
| Locus_7 | oke_serpin_140 | oke_serpin_140 |
| Locus_8 | oke_rh1op_245 | oke_rh1op_245 |
| Locus_9 | oke_tcp1_78 | oke_tcp1_78 |
| Locus_10 | oke_il8r2_406 | oke_il8r2_406 |
| Locus_11 | oke_mlrn_63 | oke_mlrn_63 |
| Locus_12 | oke_arf_319 | oke_arf_319 |
| Locus_13 | oke_u2057_80 | oke_u2057_80 |
| Locus_14 | oke_u2056_90 | oke_u2056_90 |
| Locus_15 | oke_u2054_58 | oke_u2054_58 |
| Locus_16 | oke_u2053_60 | oke_u2053_60 |
| Locus_17 | oke_u2048_91 | oke_u2048_91 |
| Locus_18 | oke_u2043_51 | oke_u2043_51 |
| Locus_19 | oke_u2041_84 | oke_u2041_84 |
| Locus_20 | oke_u2037_76 | oke_u2037_76 |
| Locus_21 | oke_u2035_54 | oke_u2035_54 |
| Locus_22 | oke_u2034_55 | oke_u2034_55 |
| Locus_23 | oke_u2032_74 | oke_u2032_74 |
| Locus_24 | oke_u2031_37 | oke_u2031_37 |
| Locus_25 | oke_u2029_79 | oke_u2029_79 |
| Locus_26 | oke_u2025_86 | oke_u2025_86 |
| Locus_27 | oke_u2011_107 | oke_u2011_107 |
| Locus_28 | oke_u2006_109 | oke_u2006_109 |
| Locus_29 | oke_u1024_113 | oke_u1024_113 |
| Locus_30 | oke_u1023_147 | oke_u1023_147 |
| Locus_31 | oke_u1015_255 | oke_u1015_255 |
| Locus_32 | oke_u1008_83 | oke_u1008_83 |
| Locus_33 | oke_u1002_262 | oke_u1002_262 |
| Locus_34 | oke_thic_84 | oke_thic_84 |
| Locus_35 | oke_sylc_90 | oke_sylc_90 |
| Locus_36 | oke_slc1a3a_86 | oke_slc1a3a_86 |
| Locus_37 | oke_rspry1_106 | oke_rspry1_106 |
| Locus_38 | oke_rab5a_117 | oke_rab5a_117 |
| Locus_39 | oke_psmd9_57 | oke_psmd9_057 |
| Locus_40 | oke_nc2b_148 | oke_nc2b_148 |
| Locus_41 | oke_mgll_49 | oke_mgll_49 |
| Locus_42 | oke_lamp2_186 | oke_lamp2_186 |
| Locus_43 | oke_glrx1_78 | oke_glrx1_78 |
| Locus_44 | oke_f5_71 | oke_f5_71 |
| Locus_45 | oke_eif4g1_43 | oke_eif4g1_43 |
| Locus_46 | oke_e2ig5_50 | oke_e2ig5_50 |
| Locus_47 | oke_dcxr_87 | oke_dcxr_87 |
| Locus_48 | oke_ccd16_77 | oke_ccd16_77 |
| Locus_49 | oke_catb_60 | oke_catb_60 |
| Locus_50 | oke_brp16_65 | oke_brp16_65 |
| Locus_51 | oke_brd2_118 | oke_brd2_118 |
| Locus_52 | oke_atp5l_105 | oke_atp5l_105 |
| Locus_53 | oke_acot_100 | oke_acot_100 |
| Locus_54 | oke_u507_286 | oke_u507_286 |
| Locus_55 | oke_u509_219 | oke_u509_219 |
| Locus_56 | oke_ctgf_105 | oke_ctgf_105 |
| Locus_57 | oke_u506_110 | oke_u506_110 |
| Locus_58 | oke_u504_228 | oke_u504_228 |
| Locus_59 | oke_gpdh_191 | oke_gpdh_191 |
| Locus_60 | oke_hp_182 | oke_hp_182 |
| Locus_61 | oke_cks_389 | oke_cks_389 |
| Locus_62 | oke_u1021_102 | oke_u1021_102 |
#kableExtra::row_spec(7, bold = T, color = "black", background = "#D7261E")
So looking down the list we see that we have locus pgap-111 in the ADF&G baseline and pgap-92 in the WDFW baseline. In the ADF&G report they state, ‘These SNPs were combined into a phenotypic locus, Oke_pgap-111-92, in preliminary investigations of linkage. However, Oke_pgap-111 was retained and Oke_pgap-92 dropped after the linkage was not shown to be useful in proof tests using the complete baseline.’ It was shown to be linked in 80% of the populations.
Loci oke_ahr1_78 and oke_ahr1_278 are similarly named, but they are not the same locus. ADF&G has both listed, but only 78 is included in the final marker panel. WDFW used oke_ahr1_278.
So we also had a partial match on oke_cks1_94 and oke_cks1_70. It turns out that ADF&G had both loci included in their report, but only 94 was included in the panel. WDFW did the reverse in their panel, they retained oke_cks1_70 and dropped oke_cks1_94. We can remove these.
Wes had loci Oke_CD123_62_F and Oke_CD81-108 as matching. I think this is likely just an error from doing it so quickly. He also had Oke_KPNA2_87_F included by it is not in the final 350GTseq panel for WDFW, but it is used in the ADF&G baseline.
I had three loci that matched between the ADF&G and WDFW baseline that Wes didn’t have (oke_glrx1_78, oke_mgll_49, and oke_u200_385).
Now I need to create an input file with just the 62 loci that overlap
ADFG_BaselineSNPsLoci<-read_lines("./MasterData/WASSIP310Pops91loci.gen")[2:92]%>%
tolower()%>%
str_split(pattern="\\.")%>% #this splits the haplotypes
unlist()%>%
gsub(pattern="-",replacement="_")
AlleleIndex<-which(ADFG_BaselineSNPsLoci%in%ADFG_Loci2Use == T)[-62]
# Huge note!!!! Locus#90 oke_u1021_102 is a micro-haplotype from two loci.
# It is actually scored as an integer for the haplotype
# We will probably want to drop it later if we can't figure out how to
# get rid of the associated locus
ADFG_BaselineSNPsLoci[AlleleIndex]
ADFG_BaselineSNPs<-read_lines("./MasterData/WASSIP310Pops91loci.gen")[-c(1:92)]
PopIndex<-paste(grep("Pop|POP",ADFG_BaselineSNPs)+1,":",c(((grep("Pop|POP",
ADFG_BaselineSNPs)[-1])-1),length(ADFG_BaselineSNPs)),sep="")
#So now we want to write out a nicely formatted genepop file
outfile<-"ADFG_62SNPBaseline.gen"
write.table("ADFG_ReducedSNPbaseline",outfile,append=F,quote=F,row.names = F,
col.names = F)
#I am going to write the loci names that ADF&G uses
write.table(read_lines("./MasterData/WASSIP310Pops91loci.gen")[2:92][AlleleIndex],
outfile,append=T,quote=F,row.names = F,col.names = F)
for (P in 1:length(PopIndex)){
write.table("Pop",
outfile,append=T,quote=F,row.names = F,col.names = F)
PopTemp<-ADFG_BaselineSNPs[(eval(parse(text=PopIndex[P])))]%>%
str_split(pattern= "\\s+")%>%
lapply(.,"[",c(1,2,(AlleleIndex+2)))%>%
lapply(.,paste,collapse = " ")%>%
unlist()#%>%
#paste(.,"\n",sep="")
write.table(PopTemp,
outfile,append=T,quote=F,row.names = F,col.names = F)
}
Let’s take a look at the breakdown of the reporting groups and the populations within each reporting group.
For the reporting groups the size of the GOA/PNW greatly exceeds that of the other groups. And Southwest AK has the fewest number of collections. Let’s take a closer look at how man indiviuals represent each collection.
PopBase<-read.csv("./MasterData/ADFG_ChumSNP_PopNames.csv")[1:310,1:5]
RepGroups_ChumSNPs<-PopBase%>%
group_by(RepGroup)%>%
summarise(nPops = length(unique(Location)))%>%
.[c(1,9,5,6,7,8,3,2,4),]
knitr::kable(RepGroups_ChumSNPs)
| RepGroup | nPops |
|---|---|
| Asia | 35 |
| UpperYukon | 24 |
| KotzebueSound | 8 |
| NorthernDistrict | 11 |
| NorthwestDistrict | 7 |
| SouthPenninsula | 14 |
| CWAK | 64 |
| Chignik/Kodiak | 35 |
| EastofKodiak | 110 |
So again we have a lot of populations that are grouped into the East of Kodiak reporting group. There are also very few populations in the Kotzebue Sound, Northern District, Northwet District, and South Penninsula reporting groups.
We are going to evaluate the baseline with these reporting groups.
Again I am not too concerned with issues of disequilibrium becuase these loci have been so well vetted.
We are going to follow the same procedure as above with the microsatellites to compute Nei’s \(F_{ST}\) with the package hierfstat. Again, eval for this bit of the code is turned off so we don’t have to wait for the bootstrapping which can take a long time to perform with so many populations.
Chum_SNPsBase_genid<-adegenet::read.genepop("./ADFG_62SNPBaseline.gen",ncode=2)
Chum_SNPsBase_hier <- hierfstat::genind2hierfstat(Chum_SNPsBase_genid)
ChumSNPs_Fst<-hierfstat::pairwise.neifst(Chum_SNPsBase_hier)
ChumSNPs_Fst_bs <- hierfstat::boot.ppfst(dat=Chum_SNPsBase_hier,nboot=100,quant=c(0.025,0.975),diploid=TRUE)
diag(ChumSNPs_Fst)<-0
NJ.chumSNPs<-phangorn::NJ(ChumSNPs_Fst)
#NJ.chumSNPs$tip.label<-PopBase$Location
p3<-ggtree(NJ.chumSNPs)
Temp<-PopBase[,c(3,1)]
p_ChumSNPs<-p3 %<+% Temp +
geom_tiplab(aes(fill = factor(RepGroup)),
size = 1.75,
color = "black", # color for label font
geom = "label", # labels not text
label.padding = unit(0.15, "lines"), # amount of padding around the labels
label.size = 0.01) +
theme(legend.position = c(0.823,0.8),
legend.title = element_blank(), # no title
legend.key = element_blank())
p_ChumSNPs
The South Penninsula group is spread over multiple parts of the tree, but those collections do group with other Southwest AK collections. There is an interesting group of samples in the middle of the tree that we might want to take a closer look at.
#p_ChumSNPs + geom_text2(aes(subset=!isTip, label=node))
viewClade(p_ChumSNPs, node=378)
Those are the Susitna River collections.
From the NJ tree PC1 should separate East of Kodiak from all other samples and PC2 might separate the Asia reproting group from Alaska.
X_ChumSNPs <- scaleGen(Chum_SNPsBase_genid,NA.method="mean",center=T,scale=T)# can set the missing to NA, 0, or mean
pca_ChumSNPs <- dudi.pca(X_ChumSNPs,cent=F,scale=F,scannf=FALSE,nf=3) #nf is kept axes, note you can center and scale in this step
barplot(pca_ChumSNPs$eig[1:15],main="PCA eigenvalues", col=heat.colors(15))
PopRGmap<-data.frame(Pop = as.character(PopBase[,3]),
RepGroup = as.character(PopBase[,1]),
stringsAsFactors = F)
pca_ChumSNPsdf<-data.frame(Ind = dimnames(Chum_SNPsBase_genid@tab)[[1]],
Pop = pop(Chum_SNPsBase_genid),
RepGroup = plyr::mapvalues(
x=pop(Chum_SNPsBase_genid), from = PopRGmap$Pop,
to = PopRGmap$RepGroup),
pca_ChumSNPs$li,
stringsAsFactors = F)
pca_ChumSNP1 <- ggplot(pca_ChumSNPsdf) +
geom_point( aes(Axis1, Axis2,color=as.factor(RepGroup)),size=2,alpha = 1/2)+
theme(axis.line = element_blank(),
panel.background = element_blank()) +
geom_vline(xintercept = 0)+
geom_hline(yintercept = 0)+
scale_color_discrete(name = "Reporting Group") +
theme(legend.position = "none")
pca_ChumSNP2 <- ggplot(pca_ChumSNPsdf) +
geom_point( aes(Axis2, Axis3,color=as.factor(RepGroup)),size=2,alpha = 1/2)+
theme(axis.line = element_blank(),
panel.background = element_blank()) +
geom_vline(xintercept = 0)+
geom_hline(yintercept = 0)+
scale_color_discrete(name = "Reporting Group")
cowplot::plot_grid(pca_ChumSNP1,pca_ChumSNP2)
PCA_Load_ChumSNPs<-data.frame(pca_ChumSNPs$c1[order(pca_ChumSNPs$c1[,1]^2,decreasing=T),],
pca_ChumSNPs$c1[order(pca_ChumSNPs$c1[,1]^2,decreasing=T),]^2,
cumsum(pca_ChumSNPs$c1[order(pca_ChumSNPs$c1[,1]^2,decreasing=T),]^2))
head(PCA_Load_ChumSNPs,10)%>%
select(Loadings=CS1,Variance=CS1.1,CumSum=CS1.2)%>%
knitr::kable()
| Loadings | Variance | CumSum | |
|---|---|---|---|
| Oke_RFC2-618.02 | 0.2368394 | 0.0560929 | 0.0560929 |
| Oke_RFC2-618.01 | -0.2368394 | 0.0560929 | 0.1121858 |
| Oke_CATB-60.01 | 0.1963107 | 0.0385379 | 0.1507237 |
| Oke_CATB-60.02 | -0.1963107 | 0.0385379 | 0.1892616 |
| Oke_DCXR-87.02 | 0.1716159 | 0.0294520 | 0.2187136 |
| Oke_DCXR-87.01 | -0.1716159 | 0.0294520 | 0.2481656 |
| Oke_rab5a-117.02 | -0.1652246 | 0.0272992 | 0.2754648 |
| Oke_rab5a-117.01 | 0.1652246 | 0.0272992 | 0.3027640 |
| Oke_PPA2-635.02 | -0.1543188 | 0.0238143 | 0.3265783 |
| Oke_PPA2-635.01 | 0.1543188 | 0.0238143 | 0.3503926 |
qplot(y=PCA_Load_ChumSNPs$CS1.2)+
labs(y = "Variance Explained",x = "SNP allele")+
geom_hline(yintercept = 0.9,lty=2,col="red")
PC1 separates Coastal Western AK and the Yukon from all other collections. It appears that the second PC splits out some of the East of Kodiak collections, but I am not going to dig into this right now. The third PC pulls out the Asia collections. Locus RFC2-618 explains quite a bit of variance in the first PC. It was identified as an outlier locus in Seeb et al. (Seeb et al. 2011). From that paper it looks like locus U502-241 might be similarly effective, but it is not included in this panel.
For the GSI evaluations I need to convert the genepop file to rubias.
source("./functions/Genepop2rubias.R")
####Genepop2rubias
infile="./ADFG_62SNPBaseline.gen"
outfile = "ADFG_62SNPBaselineRubias.csv"
ReportingGroupFile="./ChumSNPs_RepGroups.csv"
Genepop2rubias_baseline(infile=infile,
outfile=outfile,
digits=2,
ReportingGroupFile=ReportingGroupFile)
write.csv(file="ChumSNPs_RepGroups.csv",x=PopRepGroups,row.names = F,quote = F)
ChumSNPsBase<-read.csv(file.path(outfile))
ChumSNPsBase[,1:4]<-lapply(1:4,function(x) as.character(ChumSNPsBase[,x]))
CollectionInfo_ChumSNPs<-ChumSNPsBase%>%
group_by(repunit,collection)%>%
summarise(nInd = length(unique(indiv)))
sa_ChumSNPs <- self_assign(reference = ChumSNPsBase, gen_start_col = 5)
## Summary Statistics:
##
## 32817 Individuals in Sample
##
## 61 Loci: Oke_ACOT.100, Oke_arf.319, Oke_ATP5L.105, Oke_brd2.118, Oke_brp16.65, Oke_CATB.60, Oke_ccd16.77, Oke_CKS.389, Oke_ctgf.105, Oke_DCXR.87, Oke_e2ig5.50, Oke_eif4g1.43, Oke_f5.71, Oke_glrx1.78, Oke_GPDH.191, Oke_HP.182, Oke_il.1racp.67, Oke_IL8r2.406, Oke_LAMP2.186, Oke_mgll.49, Oke_MLRN.63, Oke_nc2b.148, Oke_PPA2.635, Oke_psmd9.57, Oke_rab5a.117, Oke_ras1.249, Oke_RFC2.618, Oke_RH1op.245, Oke_RSPRY1.106, Oke_serpin.140, Oke_slc1a3a.86, Oke_sylc.90, Oke_TCP1.78, Oke_thic.84, Oke_U1002.262, Oke_U1008.83, Oke_U1015.255, Oke_U1017.52, Oke_U1023.147, Oke_U1024.113, Oke_u200.385, Oke_U2006.109, Oke_U2011.107, Oke_U2025.86, Oke_U2029.79, Oke_U2031.37, Oke_U2032.74, Oke_U2034.55, Oke_U2035.54, Oke_U2037.76, Oke_U2041.84, Oke_U2043.51, Oke_U2048.91, Oke_U2053.60, Oke_U2054.58, Oke_U2056.90, Oke_U2057.80, Oke_U504.228, Oke_U506.110, Oke_U507.286, Oke_U509.219
##
## 9 Reporting Units: Asia, KotzebueSound, CWAK, UpperYukon, NorthernDistrict, NorthwestDistrict, SouthPenninsula, Chignik/Kodiak, EastofKodiak
##
## 310 Collections: CMNAMD05F, CMTOKA02, CMKUSH98, CMNISH97, CMABAS98, CMSHIN02, CMTESH01, CMYURA97E, CMYURA97L, CMCHIT03E, CMSASA90, CMSHAR01, CMTOKOR05, CMTSHIB04, CMGAKK03E, CMNAIB95, CMTYM95, CMBOL97, CMPARA98, CMAMU97, CMBIST98, CMHAIR93, CMOZER98, CMPYMT93B, CMPENZ93, CMKOL90B, CMVORO93, CMKAM90, CMPALA98UW, CMMAG91, CMOSS96, CMTAUY90, CMOLA90, CMOKLA93, CMKANC91, CMUDAR94, CMINMA05UW, CMKIAN04UW, CMKOB91, CMKEL91, CMNOA91, CMSEL94, CMAGIA05, CMAMER05, CMELDO05, CMNOME05, CMPIL94, CMSNA95, CMSOL96, CMFISH04, CMKWIN04UW, CMNIUK04, CMTUBU09, CMSHAK05, CMPIKM05, CMKOYU05, CMUNAL04UW, CMUNGA10, CMBLAC06, CMAND93W, CMANDR04EUW, CMCHUL89, CMANV93D, CMANV92B, CMYUKA93, CMKALT92, CMNUL94, CMGIS04, CMCLE02, CMHUS93, CMSFKO96E, CMHENS04, CMMELO03, CMTOZI03, CMKWE94, CMTUL07, CMKIS94, CMANI92, CMSALM07UW, CMHOL95, CMKOG07, CMKAS94, CMGEO07, CMSTO94L, CMNECO07, CMTATL07, CMNUN94, CMTAK07, CMBIGR08, CMSFKU08, CMWINDF08, CMKAN07UW, CMGOOD06NF, CMMEKO06C, CMTOGRM11, CMOSVIAK10UW, CMSUNS06, CMIOW10UW, CMMUL94, CMKOKW11, CMNUS93, CMSTUY93, CMKLUTU10, CMBRIB93, CMALAG10UW, CMWHLMT10UW, CMPUMIC10, CMWAND10, CMHENS95, CMSFKO96L, CMJIM10, CM17MI10F, CMTAN93, CMTOKA94, CMKAN01, CMCHE94, CMSAL01, CMDEL94, CMBLU92, CMBSAL01, CMCHAN01, CMSHE92, CMBLAC95, CMOLDC07, CMFBR07UW, CMKLUA07, CMPEL93, CMMINTS89, CMTATN92, CMBCK95, CMTES92, CMDON94, CMWIGG09UW, CMMES92UW, CMPLEN09UW, CMMESB09UW, CMILNIK02UW, CMNCSEN10UW, CMRHMB09UW, CMLAW92UW, CMCOAL08UW, CMDEERV08UW, CMNELSR08UW, CMMOF96UW, CMJOS09UW, CMFRO09UW, CMALL96UW, CMTRA92UW, CMSTC09UW, CHPET92UW, CMLIJ09UW, CMSANC96UW, CMRUS93UW, CMDEL96UW, CMBEL92UW, CMVOL09UW, CMRUB96UW, CMCAN92UW, CMZAC92UW, CHBAL92UW, CMCOL96UW, CMCHI96UW, CMSTE09UW, CMSTEPR09UW, CMIVAN09UW, CMWESJ93UW, CMWESK93UW, CMWESE93UW, CMAMBM09UW, CMNECR08UW, CMOCEB09UW, CMNAKIL08UW, CMCHIGK09UW, CMKIAL09UW, CMPASS09UW, CMDRYBR09UW, CMWESD93UW, CMWESG93UW, CMBIGRI09UW, CMKARL09UW, CMSTU09UW, CMBSU92UW, CMDEAD09UW, CMSITKI09UW, CMPORTNE09UW, CMBARL09UW, CMWKIL09UW, CHDOG92UW, CMCOXC09UW, CMGULLC09UW, CMEAGLH09UW, CMROUGH09UW, CHAME92UW, CMRUSSI07UW, CHKIZ92UW, CHUGA92UW, CMSPIRU09UW, CMZACR09UW, CMKITB09UW, CMMcN96, CMCHU93, CMSUS96, CMTALK95, CMSPINK08, CMLSUS10UW, CMWILL10, CMCARMLK10, CMWILLIW10, CMBEART10, CMCONS10, CMSIWA10, CMWHN09, CMWEL10, 5P92EKEE, CMPWS95A, CMCHILK06, CMHERM06, CMKLEH06, CMWELLB06, CMSAWM06S, CMFCJUN06S, CMTAKU06F, CMPROS10, CMSANB06S, CMADMI10, CMSWANC10, CMLONG92, CMSALTE92, CMFORD06F, CMSIST06, CMHFHAT06, CMRALPH06, CMSAOB06, CMMEDV09, CMNAKWA06, CMWCRA06, CMKLAH86, CMHARDR10, CMDISA98, CMDISA10F, CMKARTA06, CMLAGO10F, CMDRYB06S, CMNARM87, CMSAGI10, CMSAMPL10, CMCARR09, CMNEET06S, CMNEET06F, CMTRACR06, CMFCHYD06, CMFISH88L, CMHIDIN09, CMNAKA06S, CMSTAG88, CMECST88, CMKITW06UW, CMBAGH89, CMPALL06, CMSALM89, CMSEDG89, CMSUPR89, CMKITA06, CMKITIM06, CMSNOOT06, CMWARC06, CMBIGQU06, CMGOLD06, CMLQUA06, CMPUNTL06, CMCONU06, CMNAHM06, CMNITI06, CMSARI06, CMSOOK06, CMSUGS06, CMNIMP06, CMALOU06, CMINCH06, CMNORR06, CMWEAV06, CMSALM04S, CMLSKA98F, CMSKYK94F, CMSNOQ96, CMUSAU94F, CMELWH04UW, CMDIRU02, CMMILCR94, CMSHER94F, CMSHER94S, CMBMI03F, CMDEW98F, CMDOS03S, CMHAMM05, CMHAM03S, CMLIL06F, CMLIL01S, CMQUIL97, CMUNI04S, CMI205B00F, CMJIM01S, CMJOH94S, CMNISQ04UW, CMKAL03W, CMNCR98F, CMGRA01F, CMLCR94F, CMSATS98, CMSKA02F
##
## 0.55% of allelic data identified as missing
sa_to_repu_ChumSNPs <- sa_ChumSNPs %>%
group_by(indiv, collection, repunit, inferred_repunit) %>%
summarise(repu_scaled_like = sum(scaled_likelihood))
RepGroups <- unique(PopRGmap$RepGroup)
sa_mean_scaled_like_ChumSNPs <- sa_to_repu_ChumSNPs %>%
mutate(repunit_f = factor(x = repunit, levels = RepGroups)) %>% #turn character into factor
mutate(inferred_repunit_f = factor(x = inferred_repunit, levels = RepGroups)) %>% #turn character into factor
group_by(repunit_f, inferred_repunit_f) %>%
summarise(mean_repu_scaled_like = mean(repu_scaled_like)) #take the mean of the scaled liklihoods
#Generate a heatmap
ggplot(sa_mean_scaled_like_ChumSNPs,aes(x = repunit_f, y = inferred_repunit_f, z = mean_repu_scaled_like)) +
geom_tile(aes(fill = mean_repu_scaled_like)) +
geom_text(aes(label=sprintf(unlist(sa_mean_scaled_like_ChumSNPs[,3]),fmt='%#.2f')))+
scale_fill_gradient(low = "white", high = "red", limits = c(0, 1), name = "Mean Scaled Likelihood\n") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Reporting Unit") +
ylab("Inferred Reporting Unit")
It looks like grouping Kotzebue Sound into Coastal Western Alaska and grouping the Northern District, Northwestern District, South Penninsula, and Chignik/Kodiak collections together into a SWAK group like with the microsatellites.
Proof100_scenarios_ChumSNPs <- lapply(RepGroups, function(x) tibble(repunit = x, ppn = 1.0))
names(Proof100_scenarios_ChumSNPs) <- paste("All", RepGroups, sep = "-")
Proof100_results_ChumSNPs <- assess_reference_loo(reference = ChumSNPsBase,
gen_start_col = 5,
reps = 10,
mixsize = 100,
alpha_repunit = SNPProof100_scenarios_ChumSNPs,
alpha_collection = 10) #distribution of individuals from collections ~equal
Proof100_results_ChumSNPs %>%
mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>%
group_by(repunit_scenario, iter, repunit_f) %>%
summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n)) %>%
filter(gsub(pattern="All-","",repunit_scenario) == repunit_f) %>%
ggplot(aes(x = iter, y = repprop_posterior_mean, fill = repunit_f)) +
geom_bar(stat = 'identity') +
geom_hline(yintercept = 0.9) +
scale_fill_discrete(name = "Reporting Group") +
facet_wrap(~ repunit_f) +
xlab("Iteration") +
ylab("Posterior Mean Reporting Group Proportion")
The South Penninsula group looks pretty darn bad. We can look at just those 100% mixtures to see which reporting group is being overestimated.
SNPProof100_results_ChumSNPs%>%
filter(repunit_scenario == "All-SouthPenninsula") %>%
group_by(iter, repunit)%>%
summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n)) %>%
mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>%
ggplot(aes(x = repunit, y = repprop_posterior_mean, fill = repunit_f)) +
geom_boxplot() +
geom_hline(yintercept = 0.9,lty=2,color="red") +
scale_fill_discrete(name = "Reporting Group") +
xlab("Iteration") +
ylab("Posterior Mean Reporting Group Proportion") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
When the mixture is composed entirely of fish from the South Penn reporting group we overestimate the contribution from Chignik/Kodiak.
We don’t have any realistic fishery proportions for each of the new reporting groups so I am just going to simulate a bunch of differnet proportions from the Dirichlet. I should be able to simulate proportions for each reporting group that span 0-1.
#First create some fishery proportions
DirichletProps<-as.data.frame(rbind(MCMCpack::rdirichlet(15,c(rep(0.01,9))),#somefixed
MCMCpack::rdirichlet(15,c(rep(0.1,9))),
MCMCpack::rdirichlet(15,c(rep(1,9))),
MCMCpack::rdirichlet(15,c(rep(10,9)))))
colnames(DirichletProps)<-RepGroups
DirichletProps$Analysis<-1:nrow(DirichletProps)
RFpropsDir<-reshape2::melt(DirichletProps, id=c("Analysis"))
#Make a list of scenarios
RF_scenarios_ChumSNPs <- lapply(unique(RFpropsDir$Analysis), function(x)
RFpropsDir %>%
filter(Analysis == x) %>%
ungroup()%>%
select(-1)%>%
setNames(.,c("repunit", "ppn" ))
)
names(RF_scenarios_ChumSNPs)<-paste("RF-",seq(1,nrow(DirichletProps)),sep="")
unique(CollectionInfo_ChumSNPs$repunit)[c(1,5,3,9,6,7,8,2,4)]
## [1] "Asia" "KotzebueSound" "CWAK"
## [4] "UpperYukon" "NorthernDistrict" "NorthwestDistrict"
## [7] "SouthPenninsula" "Chignik/Kodiak" "EastofKodiak"
RF_results_ChumSNPs <- assess_reference_loo(reference = ChumSNPsBase,
gen_start_col = 5,
reps = 10,
mixsize = 100,
alpha_repunit = RF_scenarios_ChumSNPs,
alpha_collection = 10, #distribution of individuals from collections ~equal
return_indiv_posteriors = F)
#for some reason I am getting NAs in the repunits
RFsum_ChumSNPs <- RF_results_ChumSNPs %>%
mutate(repunit_f = factor(x = repunit, levels = RepGroups[c(1,5,3,9,6,7,8,2,4)]))%>%
group_by(repunit_scenario, iter, repunit_f) %>%
summarise(true_repprop = sum(true_pi), reprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n))
ggplot(RFsum_ChumSNPs, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit_f)) +
geom_point() +
scale_color_discrete(name = "Reporting Group")+
geom_abline(intercept = 0, slope = 1)+
geom_abline(intercept = 0.1, slope = 1,lty=2)+
geom_abline(intercept = -0.1, slope = 1,lty=2)+
facet_wrap(~ repunit_f)+
ylab("Posterior Mean Reporting Group Proportion") +
xlab("Simulated Mixture Proportion")
Well these reporting groups don’t look like they are going to work very well with a reduced baseline.
Let’s see what happens when we group these baselines similar to the microsatellite reporting groups. I say similar becuase I am just going to add Kotzebue Sound to the CWAK and then group the four reporting groups on the penninsula and Kodiak togehter to make a SWAK group. We can also break apart the Asia reporting group.
#make a new tibble so we don't mess up the first
ChumSNPsBaseNRG<-ChumSNPsBase
#read in the poprep groups for
#PoprepGroups<-read.csv("./ChumSNPs_RepGroups.csv",stringsAsFactors = F)
#PoprepGroups$ADFGnames<-unique(ChumSNPsBaseNRG[,3])
ChumSNPsBaseNRG[,2]<-plyr::mapvalues(from=PopBase[,3],
to=as.character(PopBase[,5]),
x=ChumSNPsBaseNRG[,3])
RepGroupsNew<-unique(ChumSNPsBaseNRG$repunit)
RepGroupsOld<-unique(ChumUsatBase$repunit)
RepGroupMap<-cbind(RepGroupsNew,
RepGroupsOld
)
#sub in the new names
Chum2017[,2]<-plyr::mapvalues(from=RepGroupMap[,2],
to=RepGroupMap[,1],
x=unlist(Chum2017[,2]))
RF_scenarios <- lapply(unique(Chum2017$Analysis), function(x)
Chum2017 %>%
filter(Analysis == x) %>%
ungroup()%>%
select(-1)%>%
setNames(.,c("repunit", "ppn" ))
)
names(RF_scenarios) <- ChumAnalyses %>%
filter(year ==2017) %>%
mutate(Analysis = paste(lme,spatialstrat,temporalstrat,sep="_")) %>%
select(Analysis) %>%
unique()%>%
unlist()
RF_results_ChumSNPs_NRG <- assess_reference_loo(reference = ChumSNPsBaseNRG,
gen_start_col = 5,
reps = 10,
mixsize = 100,
alpha_repunit = RF_scenarios,
alpha_collection = 10) #distribution of individuals from collections ~equal
RFsum_ChumSNPs_NRG <- RF_results_ChumSNPs_NRG %>%
mutate(repunit_f = factor(x = repunit, levels = RepGroupsNew))%>%
group_by(repunit_scenario, iter, repunit_f) %>%
summarise(true_repprop = sum(true_pi), reprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n))
ggplot(RFsum_ChumSNPs_NRG, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit_f)) +
geom_point() +
scale_color_discrete(name = "Reporting Group")+
geom_abline(intercept = 0, slope = 1)+
geom_abline(intercept = 0.1, slope = 1,lty=2)+
geom_abline(intercept = -0.1, slope = 1,lty=2)+
facet_wrap(~ repunit_f)+
ylab("Posterior Mean Reporting Group Proportion") +
xlab("Simulated Mixture Proportion")
That looks great! It does as well or better than the microsatellite baseline.
The SNP baseline contains genetic information (43 SNP DNA markers) for 23,205 individuals from 172 populations grouped into 11 geographic regions (Templin et al. 2011).
Let’s take a look at the baseline that NOAA has been using.
ChinookSNPBase<-read.csv("./MasterData/ChinookBaseline_rubias.csv",stringsAsFactors = F)
RepGroups_Chinook<-ChinookSNPBase%>%
group_by(repunit)%>%
summarise(nPops = length(unique(collection)),
nInd = length(unique(indiv)))
knitr::kable(RepGroups_Chinook[c(9,3,5,10,6,8,4,7,2,1,11),])
| repunit | nPops | nInd |
|---|---|---|
| Russia | 4 | 340 |
| CoastWAK | 29 | 4520 |
| MidYukon | 8 | 1097 |
| UpYukon | 13 | 1734 |
| NAKPen | 6 | 479 |
| NWGOA | 19 | 3212 |
| Copper | 11 | 1181 |
| NEGOA | 7 | 987 |
| CoastSEAK | 25 | 3739 |
| BC | 36 | 4368 |
| WestCoastUS | 14 | 1548 |
Russia has very few stocks, but at the same time Russia contributes very few fish to the mixtures. All the other reporting groups have a fair number of collections. The Northern AK penninsula has only 6 collections, and Wes has had issues with convergence for this reporting group in the past (Larson et al. 2013), so let’s keep an eye on it.
CollectionInfo_Chinook<-ChinookSNPBase%>%
group_by(repunit,collection)%>%
summarise(nInd = length(unique(indiv)))
#knitr::kable(CollectionInfo)
Within the collections the average size is 134.91 (min = 42 and max = 397).
Again we will breeze by this step.
We have a cleaned up baseline file in genepop format from some other analyses I was doing. Adegenet doesn’t like haplotype data, so we need to first remove the last locus. Then reimport the file. I created a function awhile back to drop a locus from a genepop file. Let’s source that function and use it to get rid of locus C3N3. We will use it in the baseline evaluations, but drop it for the \(F_{ST}\) NJ tree and PCA plots.
source("./functions/RmGenepopLocus.R")
LocusRM(input="./MasterData/ChinookBaseline_clean.gen",locus2rm='C3N3',NewFileName='ChinookBaseline_42l.gen',digits=2,PopNameAsPrefix=T)
Next we
pulled it into R with adgenet’s read genepop function and then calculate some Nei’s \(F_{st}\) values. In the markdown document we won’t evaluate this piece everytime becuase the bootstrapping over 300 populations can take some time. Instead we will read in the results at the top of the document.
Chinook_SNPBase.genid<-adegenet::read.genepop("./ChinookBaseline_42l.gen",ncode=2)
ChinookSNPBase_hierf <- hierfstat::genind2hierfstat(Chinook_SNPBase.genid)
Chinook_SNPFst.Nei<-hierfstat::pairwise.neifst(ChinookSNPBase_hierf)
Chinook_SNPFst.WC<-hierfstat::pairwise.WCfst(ChinookSNPBase_hierf)
Chinook_SNPFst_bs <- hierfstat::boot.ppfst(dat=ChinookSNPBase_gdf,nboot=100,quant=c(0.025,0.975),diploid=TRUE)
diag(Chinook_SNPFst.Nei)<-0
A NJ tree was constructed in the package phangorn and visualized with the package ggtree.
NJ.ChinookSNP.Nei<-phangorn::NJ(Chinook_SNPFst.Nei)
NJ.ChinookSNP.Nei$tip.label<-unique(gsub(pattern="\\.[0-9]+$",replacement="",Chinook_SNPBase.genid@pop))
PopRepGroups_ChinookSNP<-ChinookSNPBase%>%
group_by(repunit,collection)%>%
slice(1)%>%
select(Pop = collection ,RepGroup = repunit)%>%
ungroup()%>%
mutate(RepGroup_f = factor(x = RepGroup, levels = levels(as.factor(unique(ChinookSNPBase$repunit)))[c(9,3,5,10,6,8,4,7,2,1,11)]))
#p_ChinookSNP + geom_text2(aes(subset=!isTip, label=node))
#viewClade(pNei_ChinookSNP, node=218)
pNei_ChinookSNP<-ggtree(NJ.ChinookSNP.Nei) %<+% PopRepGroups_ChinookSNP +
geom_tiplab(aes(fill = RepGroup_f),
size = 1.75,
color = "black", # color for label font
geom = "label", # labels not text
label.padding = unit(0.15, "lines"), # amount of padding around the labels
label.size = 0.01) +
theme(legend.position = c(0.8,0.2),
legend.title = element_blank(), # no title
legend.key = element_blank())
pNei_ChinookSNP
This is an unrooted tree, so all the short branch lengths for coastal WAK look fine. The Mid and Upper Yukon cluster together well. British Columnbia and Coastal Southeast AK are kind of a mess, but that isn’t suprising. For the most part this looks better than expected with 11 reporting groups, but at the same time they have been using this SNP panel for awhile.
X_Chinook <- scaleGen(Chinook_SNPBase.genid,NA.method="mean",center=T,scale=T)# can set the missing to NA, 0, or mean
pca_Chinook <- dudi.pca(X_Chinook,cent=F,scale=F,scannf=FALSE) #nf is kept axes, note you can center and scale in this step
barplot(pca_Chinook$eig[1:15],main="PCA eigenvalues", col=heat.colors(15))
PopRGmap_Chinook<-data.frame(Pop = as.character(unique(pop(Chinook_SNPBase.genid))),
RepGroup = as.character(sapply(1:length(unique(pop(Chinook_SNPBase.genid))),function(x)
ChinookSNPBase[grep(unique(pop(Chinook_SNPBase.genid))[x],
x=ChinookSNPBase$indiv),2])),
stringsAsFactors = F)
PCAdf_Chinook<-data.frame(Ind = dimnames(Chinook_SNPBase.genid@tab)[[1]],
Pop = pop(Chinook_SNPBase.genid),
RepGroup = plyr::mapvalues(
x=pop(Chinook_SNPBase.genid), from = PopRGmap_Chinook$Pop,
to = PopRGmap_Chinook$RepGroup),
pca_Chinook$li,
stringsAsFactors = F)
PCAdf_Chinook$Pop <- gsub(PCAdf_Chinook$Pop, pattern="_[0-9]+$", replacement = "" )
ggplot(PCAdf_Chinook) +
geom_point( aes(Axis1, Axis2,color=as.factor(RepGroup)),size=2,alpha = 1/2)+
theme(axis.line = element_blank(),
panel.background = element_blank()) +
geom_vline(xintercept = 0)+
geom_hline(yintercept = 0)+
scale_color_discrete(name = "Reporting Group")
pca.Chinook_Loads<-data.frame(Loci=row.names(pca_Chinook$c1[order(pca_Chinook$c1[,1]^2,decreasing=T),]),
pca_Chinook$c1[order(pca_Chinook$c1[,1]^2,decreasing=T),],
pca_Chinook$c1[order(pca_Chinook$c1[,1]^2,decreasing=T),]^2,
cumsum(pca_Chinook$c1[order(pca_Chinook$c1[,1]^2,decreasing=T),]^2))
Loads<-head(pca.Chinook_Loads,10)%>%
select(Loadings=CS1,Variance=CS1.1,CumSum=CS1.2)
#pca.Chinook_Loads[grep(pattern="MHC",x=pca.Chinook_Loads[,1]),]
It looks like the major division is just past Coastal SEAK. The second PC then distinguishes the upper and middle Yukon.
sa_ChinookSNPs <- self_assign(reference = ChinookSNPBase, gen_start_col = 5)
## Summary Statistics:
##
## 23205 Individuals in Sample
##
## 43 Loci: ARF, AsnRS72, C3N3, ETIF1A, FARSLA220, FGF6A, GH2, GPDH, GPH318, GST207, GST375, GTH2B550, HGFA, hnRNPL533, HSP90B100, IGF191, IK1328, IL1RA, LEI292, MHC1, MHC2, NOD1, NRP, OPLW173, OPSW152, P450, Prl2, PrpI120, RAG3, RFC2, S71, SClkF2, SERPC1209, SL, TAPBP, Tnsf, U200167, U211, U212297, UNKN4150, UNKN6187, xKER137, zP3b
##
## 11 Reporting Units: Russia, CoastWAK, MidYukon, UpYukon, NAKPen, NWGOA, Copper, NEGOA, CoastSEAK, BC, WestCoastUS
##
## 172 Collections: KBIST98L, KBOLS0298E, KKAMC98L97, KPAKH02, KPILG0506, KUNAL05, KGOLS0506, KANDR0203, KANVI02, KGISA01, KTOZI0203, KHENS01, KSFKOY03, KBEAV97, KCHAN020304, KSHEE020406, KKANT05, KCHENA01, KSALC05, KCHAU000103, KKLON010395, KSTEW97, KMAYO039297, KBLIN03, KPELL9697, KLSAL8797, KBIGS8797, KTATC87970203, KNORD03, KNISU8797, KTAKH039702, KWHITE858797, KGOMF050693, KAROL05, CHKAN929305, KEEK0205, KKWET01, KKANE01KISA05, KITUL9394TULU05, KSALM0206, KGEOR0205, CHKOG929305, KISTO94, KCHEE0206, KGAGA06, KIMCG94TAKW05, KTATL0205, KSALM95, KITOG9394, CHNUU92KINUS93, KIMUL94, KISTU9394, KNEK95MAIN04, KBIGCK04, KKSALM06AB, KMESH06, KMILK06, KNELS06, KBLACH06, KSTEE06, KCHIG9506, KIAYA93AYAK06, KIKAR9306, KDESH05MOOD95, CHDEC91112, KWILL05, KPRAI95, KTALA95, KCRESC06, KKILL0506, KBENJ0506, KFUNN0506, KSLIK05, KJUNE0506, KKENA0403M06, KCROCK0592, KKASL05M05, KANCH06, KNINI06, KINDI0405, KBONE0405, KCHIS04E, KOTTER05, KSINO0504, KGULK04MMFPF, KMEND04, KKIAN04, KMANK0405, KGRAY04LTON0405, KTEBA040506, KSITU90889192, KKLUK8990, CHBIG92930495, CHTAH9204, KDIPAC05, KKELS04, KKSR899093, KKING03, KCHIK0390, KICHI93LPWC05, CHWHI929805, KHUMP03, KBUTL04, KCLEA030489, KCRIP0388, KGENE030489, KKERR0304, KLPWU05, CHDMT92DEER94, KKETA8903, KBLOS04, KANDR8904, CHCRY929405, KMEDV9805, KHIDD9498, KMACA05, KKOWA8990, KLTAT899005, KUNAH8990, KNAKI8990, KDUDI05, KTAHL89, KKATEE05, KDAMD96, KKINC96, KKWINA96, KOWEG96, KBULK99, KSUST01, KECST0102, KLKALU01, KLATN96, KKITI97, KWANN96, KKLIN97, KPTCV03, KCONU9798, KMARB009699, KNITI96, KROBE9603, KSARI0197, KBIGQU96, KNANA02, KQUIN96, KMORK01, KSALBC97, KTORP01, KCHIL02959699, KNECH96, KQUES96, KSTUA96, KCLWA97, KLOUI01, KLADA96, KLTHO01, KMSHU8697, KBIRK0102039799, KHARR02, KMAKA01F03F, KFORK05, KUSKAG06SU, KSOOS04F, KLYON0203, KHANF000406, KLDES02, KCARS01, KMCKE04, KALSE04, KSIUS01, KSRF90SRS90GHF06, KEELR00001A, KSACW05
##
## 3.44% of allelic data identified as missing
sa_to_repu <- sa_ChinookSNPs %>%
group_by(indiv, collection, repunit, inferred_repunit) %>%
summarise(repu_scaled_like = sum(scaled_likelihood))
RepGroups <- unique(PopRepGroups_ChinookSNP$RepGroup) [c(9,3,5,10,6,8,4,7,2,1,11)]
sa_mean_scaled_like <- sa_to_repu %>%
mutate(repunit_f = factor(x = repunit, levels = RepGroups)) %>% #turn character into factor
mutate(inferred_repunit_f = factor(x = inferred_repunit, levels = RepGroups)) %>% #turn character into factor
group_by(repunit_f, inferred_repunit_f) %>%
summarise(mean_repu_scaled_like = mean(repu_scaled_like)) #take the mean of the scaled liklihoods
#Generate a heatmap
ggplot(sa_mean_scaled_like,aes(x = repunit_f, y = inferred_repunit_f, z = mean_repu_scaled_like)) +
geom_tile(aes(fill = mean_repu_scaled_like)) +
geom_text(aes(label=sprintf(unlist(sa_mean_scaled_like[,3]),fmt='%#.2f')))+
scale_fill_gradient(low = "white", high = "red", limits = c(0, 1), name = "Mean Scaled Likelihood\n") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Reporting Unit") +
ylab("Inferred Reporting Unit")
This doesn’t look too bad. Most of the missallocations are to neighboring reporting groups. Chinook probably have a pretty high isolation by distance signal.
Proof100_scenarios.ChinookSNPS <- lapply(RepGroups, function(x) tibble(repunit = x, ppn = 1.0))
names(Proof100_scenarios.ChinookSNPS) <- paste("All", RepGroups, sep = "-")
Proof100_results.ChinookSNPS <- assess_reference_loo(reference = ChinookSNPBase,
gen_start_col = 5,
reps = 10,
mixsize = 100,
alpha_repunit = Proof100_scenarios.ChinookSNPS,
alpha_collection = 10) #distribution of individuals from collections ~equal
Proof100_results.ChinookSNPS %>%
mutate(repunit_f = factor(x = repunit, levels = unlist(RepGroups))) %>% #[c(2,3,6,5,4,1)]
group_by(repunit_scenario, iter, repunit_f) %>%
summarise(true_repprop = sum(true_pi), repprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n)) %>%
filter(gsub(pattern="All-","",repunit_scenario) == repunit_f) %>%
ggplot(aes(x = iter, y = repprop_posterior_mean, fill = repunit_f)) +
geom_bar(stat = 'identity') +
geom_hline(yintercept = 0.9) +
scale_fill_discrete(name = "Reporting Group") +
facet_wrap(~ repunit_f) +
xlab("Iteration") +
ylab("Posterior Mean Reporting Group Proportion")
The 100% mixtures look almost too good to be true! Let’s pull in some realistic fishery proportions from an old tech memo so that we can evaluate scenarios that we might encounter when performing GSI on the trawl bycatch.
I pulled the 2015 mixture proportions for the 2015 Bering Sea Walleye Pollock (Gadus chalcogrammus) Trawl Fishery (Guthrie III et al. 2017).
ChinookAnalyses<-read_csv("./MasterData/Chinook_2017_gsi.csv")
ChinookAnalysesM<-reshape2::melt(ChinookAnalyses,id="Analysis")
colnames(ChinookAnalysesM)[c(2,3)]<-c("RepGroup","Prop")
RF_scenarios_Chinook <- lapply(unique(ChinookAnalyses$Analysis), function(x)
ChinookAnalysesM %>%
filter(Analysis == x) %>%
ungroup()%>%
select(-1)%>%
setNames(.,c("repunit", "ppn" ))
)
names(RF_scenarios_Chinook) <- ChinookAnalyses$Analysis
RF_results_Chinook <- assess_reference_loo(reference = ChinookSNPBase,
gen_start_col = 5,
reps = 15,
mixsize = 100,
alpha_repunit = RF_scenarios_Chinook,
alpha_collection = 10) #distribution of individuals from collections ~equal
RFsum_Chinook <- RF_results_Chinook %>%
mutate(repunit_f = factor(x = repunit, levels = RepGroups))%>%#[c(2,3,6,5,4,1)]))%>%
group_by(repunit_scenario, iter, repunit_f) %>%
summarise(true_repprop = sum(true_pi), reprop_posterior_mean = sum(post_mean_pi), repu_n = sum(n)) %>%
mutate(repu_n_prop = repu_n / sum(repu_n))
ggplot(RFsum_Chinook, aes(x = repu_n_prop, y = reprop_posterior_mean, colour = repunit_f)) +
geom_point() +
scale_color_discrete(name = "Reporting Group")+
geom_abline(intercept = 0, slope = 1)+
geom_abline(intercept = 0.1, slope = 1,lty=2)+
geom_abline(intercept = -0.1, slope = 1,lty=2)+
facet_wrap(~ repunit_f)+
ylab("Posterior Mean Reporting Group Proportion") +
xlab("Simulated Mixture Proportion")
That looks pretty darn good.
DeCovich, N., T.H. Dann, S.D. Rogers Olive, H. L. Liller, E.K.C. Fox, J.R. Jasper, E.L. Chenoweth, C. Habicht, and W.D. Templin. 2012. “Chum Salmon Baseline for the Western Alaska Salmon Stock Identification Program.” Special Publication 12-26. Alaska Department of Fish; Game.
Guthrie III, C.M., Hv.T. Nguyen, A.E. Thomson, and J.R. and Guyon. 2017. “Genetic Stock Composition Analysis of the Chinook Salmon Bycatch from the 2015 Bering Sea Walleye Pollock (Gadus Chalcogrammus) Trawl Fishery.” NOAA Technical Memorandum NMFS-AFSC-342. 17109 Pt. Lena Loop Rd. Juneau, Alaska 99801: National Oceanic; Atmospheric Administration.
Guyon, J.R., C.M. Kondzela, T McCraney, C. Marvin, and E. Martinson. 2010. “Genetic Stock Composition Analysis of Chum Salmon Bycatch Samples from the 2005 Bering Sea Groundfish Fishery.” 605 W. 4th Ave, Anchorage, Alaska: North Pacific Fishery Management Council.
Kondzela, C.T. AND Vulstek, C.M. AND Marvin, and J.R. Guyon. 2013. “Genetic Stock Composition Analysis of Chum Salmon Bycatch Samples from the 2011 Bering Sea Walleye Pollock Trawl Fishery.” NOAA Technical Memorandum NMFS-AFSC-243. NOAA.
Kondzela, J.A. AND Vulstek, C.M. AND Whittle, and J.R. Guyon. 2017. “Genetic Stock Composition Analysis of Chum Salmon Bycatch Samples from the 2011 Bering Sea Walleye Pollock Trawl Fishery.” NOAA Technical Memorandum NMFS-AFSC-345. NOAA.
Larson, W.A., F.M. Utter, K.W. Myers, W.D. Templin, J.E. Seeb, C.M. Guthrie III, A.V. Bugaev, and L.W. Seeb. 2013. “Single-Nucleotide Polymorphisms Reveal Distribution and Migration of Chinook Salmon (Oncorhynchus Tshawytscha) in the Bering Sea and North Pacific Ocean.” Canadian Journal of Fisheries and Aquatic Sciences 70 (1): 128–41. doi:10.1139/cjfas-2012-0233.
Moran, B.M., and E.C. Anderson. 2019. “Bayesian Inference from the Conditional Genetic Stock Identification Model.” Canadian Journal of Fisheries and Aquatic Sciences 76 (4): 551–60. doi:10.1139/cjfas-2018-0016.
Pella, J AND Masuda M. 2001. “Bayesian Methods for Analysis of Stock Mixtures from Genetic Characters.” Fisheries Bulletin 99: 151–67.
Petrou, E.L., J.E. Seeb, L. Hauser, M.J. Witteveen, W.D. Templin, and L.W. Seeb. 2014. “Fine-Scale Sampling Reveals Distinct Isolation by Distance Patterns in Chum Salmon (Oncorhynchus Keta) Populations Occupying a Glacially Dynamic Environment.” Conservation Genetics 15 (1): 229–43. doi:10.1007/s10592-013-0534-3.
Price, M.A. 2001. “Abundance and Run Timing of Adult Chum Salmon, and Steelhead Kelt Emigration in the Sturgeon River, Kodiak, Alaska, 1999.” Alaska Fisheries Data Series 2001-2. U.S. Fish; Wildlife Service.
Seeb, L.W., W.D. Templin, S. Sato, S. Abe, K. Warheit, J.Y. Park, and J.E. Seeb. 2011. “Single Nucleotide Polymorphisms Across a Species’ Range: Implications for Conservation Studies of Pacific Salmon.” Molecular Ecology Resources 11 (SUPPL. 1): 195–217. doi:10.1111/j.1755-0998.2010.02966.x.
Templin, W.D., J.E. Seeb, J.R. Jasper, A.W. Barclay, and L.W. Seeb. 2011. “Genetic Differentiation of Alaska Chinook Salmon: The Missing Link for Migratory Studies.” Molecular Ecology Resources 11 (SUPPL. 1): 226–46. doi:10.1111/j.1755-0998.2010.02968.x.